Timing the evolution of phosphorus-cycling enzymes through geological time using phylogenomics

Phosphorus plays a crucial role in controlling biological productivity, but geological estimates of phosphate concentrations in the Precambrian ocean, during life’s origin and early evolution, vary over several orders of magnitude. While reduced phosphorus species may have served as alternative substrates to phosphate, their bioavailability on the early Earth remains unknown. Here, we reconstruct the phylogenomic record of life on Earth and find that phosphate transporting genes (pnas) evolved in the Paleoarchean (ca. 3.6-3.2 Ga) and are consistent with phosphate concentrations above modern levels ( > 3 µM). The first gene optimized for low phosphate levels (pstS; <1 µM) appeared around the same time or in the Mesoarchean depending on the reconstruction method. Most enzymatic pathways for metabolising reduced phosphorus emerged and expanded across the tree of life later. This includes phosphonate-catabolising CP-lyases, phosphite-oxidising pathways and hypophosphite-oxidising pathways. CP-lyases are particularly abundant in dissolved phosphate concentrations below 0.1 µM. Our results thus indicate at least local regions of declining phosphate levels through the Archean, possibly linked to phosphate-scavenging Fe(III), which may have limited productivity. However, reduced phosphorus species did not become widely used until after the Paleoproterozoic Great Oxidation Event (2.3 Ga), possibly linked to expansion of the biosphere at that time.

Figure 1: Uncertainty in estimating the origin of pstS.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).losses (triangles) and speciations (asterisks) of genes associated with microbial phosphate import (yellow), phosphonate catabolism (pink), phosphonate production (green), phosphite import and oxidation (blue) and hypophosphite import and oxidation (black).Faded colours represent the midpoint of events which occur on terminal branches, whilst bright colours represent the midpoint of events which occur on internal branches.Mya, million years ago; GOE, great oxygenation event.The constraint ensures that Euryarchaeota are more closely-related to each other than they are to other Archaea to match the findings of previous Archaeal phylogenies 4,5 .Similarly, Gracilicutes are constrained to be more closely related to each other than to Terrabacteria or Archaea and vice versa based on the findings of [6][7][8][9][10][11][12][13] .4,15 ) produces conflicting evidence on the monophyly of Terrabacteria, we chose to constrain them here because tree certainty metrics have found that artefacts in the form of long-branch attraction towards Archaea and uneven taxon sampling can split the Terrabacterial phylum into separate parts 11 .CPR are assumed to be closely-related to Chloroflexi and Dormibacteriota within the Terrabacteria despite some conflicting evidence (e.g. 9,10,14) because their placement outside of Terrabacteria has also been found to be a result of uneven taxon sampling [6][7][8]11 .

Figure 2 :
Figure 2: Uncertainty in estimating the origin of PNaS.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 3 :
Figure 3: Uncertainty in estimating the origin of phnJ.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 4 :
Figure 4: Uncertainty in estimating the origin of phnM.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 5 :
Figure 5: Uncertainty in estimating the origin of phnZ.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 6 :
Figure 6: Uncertainty in estimating the origin of phnX.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 7 :
Figure 7: Uncertainty in estimating the origin of phnA.Horizontal lines represent the lengths of branches where horizontal gene transfers (pink) are predicted to have occurred on internal branches of the tree of life.The midpoint of each branch is marked with filled circles of the same colour.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 8 :
Figure 8: Uncertainty in estimating the origin of phnW.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 9 :
Figure 9: Uncertainty in estimating the origin of ppd.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 10 :
Figure 10: Uncertainty in estimating the origin of pepM.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 11 :
Figure 11: Uncertainty in estimating the origin of ptxD.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 12 :
Figure 12: Uncertainty in estimating the origin of ptxB.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 13 :
Figure 13: Uncertainty in estimating the origin of htxB.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 14 :
Figure 14: Uncertainty in estimating the origin of htxA.Horizontal lines represent the lengths of branches where gene duplications (green), horizontal gene transfers (pink) and losses (orange) are predicted to have occurred.The midpoint of each branch is marked with shapes of the same colour, representing duplications (filled squares), horizontal gene transfers (filled circles) and losses (empty triangles).Darkness indicates whether the event occurred on an internal (dark colour) or terminal (faded) branch of the tree of life.Gene speciations (blue asterisks) are not associated with branch lengths because they occur on internal nodes of the tree.Results found using three different clock models are shown (cir: Cox-Ingersoll-Ross, ln: lognormal, ugam: uncorrelated gamma multipliers).

Figure 15 :
Figure 15: Effect of higher costs for horizontal gene transfer (HGT) on the estimated timing of evolution of phosphorus-cycling enzymes.HGT cost is varied from the default of 3 (a) to 4 (b) and 6 (c) whilst retaining the default costs of gene duplication, loss, and speciation.These reconciliations were calculated using results of molecular clocks made with the CIR clock model.Colours represent duplications (squares), transfers (circles),

Figure 16 :
Figure 16: Effect of lower HGT costs on the estimated timing of the evolution of phosphorus-cycling enzymes.HGT cost is reduced from the default of 3 (a) down to 2 (b) whilst retaining the default costs of gene duplication, loss, and speciation.These reconciliations were calculated using results of molecular clocks made with the CIR clock model.Colours represent duplications (squares), transfers (circles), losses (triangles) and speciations (asterisks) of genes associated with microbial phosphate import (yellow), phosphonate catabolism (pink), phosphonate production (green), phosphite import and oxidation (blue) and hypophosphite import and oxidation (black).Faded colours represent the midpoint of events which occur on terminal branches, whilst bright colours represent the midpoint of events which occur on internal branches.Mya, million years ago; GOE, great oxygenation event.

Figure 17 :
Figure 17: Geochemical estimates of phosphate concentrations in the Archaean ocean.Note how phosphate concentrations are indicated on a logarithmic x axis, made with base 10.

Figure 18 :
Figure 18: Phylogenetic trees representing where speciations (numbered circles), duplications, transfers and losses (thick lineages) of pstS (right) and pnas (left) occurred in the tree of life based on three different molecular clock models.Speciations are numbered in chronological order which Archaean events highlighted in black circles, and others in grey circles.The thickness of each branch corresponds to the number of duplications, transfers and losses that occurred on the branch, whereas branch colours represent which phyla the lineage belongs to.

Figure 19 :
Figure 19: Phylogenetic trees representing where speciations (black numbered circles), duplications, transfers and losses (thick lineages) of phnJ (left) and phnM (right) occurred in the tree of life based on three different molecular clock models.Speciations are numbered in chronological order.The thickness of each branch corresponds to the number of duplications, transfers and losses that occurred on the branch, whereas branch colours represent which phyla the lineage belongs to.

Figure 20 :
Figure 20: Constraints applied to the tree of life and molecular clock.Coloured text next to wedges describes which phyla they contain, including Archaea (red and orange), bacteria of the Gracilicutes clade (blue) and bacteria of the Terrabacteria clade (green).The constraint ensures that Euryarchaeota are more closely-related to each other than they are to other Archaea to match the findings of previous Archaeal phylogenies4,5 .Similarly, Gracilicutes are constrained to be more closely related to each other than to Terrabacteria or Archaea and vice versa based on the findings of[6][7][8][9][10][11][12][13] .Although previous research (e.g.[8][9][10]14,15 ) produces conflicting evidence on the monophyly of Terrabacteria, we chose to constrain them here because tree certainty metrics have found that artefacts in the form of long-branch attraction towards Archaea and uneven taxon sampling can split the Terrabacterial phylum into separate parts 11 . CPR ae assumed to be closely-related to Chloroflexi and Dormibacteriota within the Terrabacteria despite some conflicting evidence (e.g.9,10,14 ) because their placement outside of Terrabacteria has also been found to be a result of uneven taxon sampling[6][7][8]11  .

Figure 22 :
Figure 22: Identification of pstS orthologs.Maximum-likelihood phylogeny of pstS homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars and pink stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 23 :
Figure 23: Identification of pnas orthologs.Maximum-likelihood phylogeny of pnas homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 24 :
Figure 24: Identification of phnJ orthologs.Maximum-likelihood phylogeny of phnJ homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 25 :
Figure 25: Identification of phnM orthologs.Maximum-likelihood phylogeny of phnM homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 26 :
Figure 26: Identification of phnZ orthologs.Maximum-likelihood phylogeny of phnZ homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 27 :
Figure 27: Identification of phnX orthologs.Maximum-likelihood phylogeny of phnX homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 28 :
Figure 28: Identification of phnA orthologs.Maximum-likelihood phylogeny of phnA homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 29 :
Figure 29: Identification of phnW orthologs.Maximum-likelihood phylogeny of phnW homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.

Figure 30 :
Figure 30: Identification of ppd orthologs.Maximum-likelihood phylogeny of ppd homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.

Figure 31 :
Figure 31: Identification of pepM orthologs.Maximum-likelihood phylogeny of pepM homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 32 :
Figure 32: Identification of ptxD orthologs.Maximum-likelihood phylogeny of ptxD homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 33 :
Figure 33: Identification of ptxB orthologs.Maximum-likelihood phylogeny of ptxB homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 34 :
Figure 34: Identification of htxB orthologs.Maximum-likelihood phylogeny of htxB homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange circles) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 35 :
Figure 35: Identification of htxA orthologs.Maximum-likelihood phylogeny of htxA homologs, with bitscores represented by coloured outlines surrounding each branch.Orthologs (orange stars) were identified and selected for further analyses based on their bitscore and relationship to query sequences (yellow stars) from the respective HMM profiles.White circles indicate ultrafast bootstrap support values => 95.

Figure 36 :
Figure 36: Differences between estimating the origin of phnX using cir (A), ln (B) and ugam (c) with Bayesian and Maximum Likelihood methodology.Bayesian results (orange) were collected using MrBayes to reconstruct the phnX phylogeny, whereas maximum likelihood results (cyan) were collected using IQ-TREE to reconstruct the phnX phylogeny.Horizontal lines represent the lengths of internal (dark colours) and terminal (pale colours) branches on which gene duplications, transfers and losses are predicted to have occurred.

Table 1 : Phosphorus-cycling genes analysed in this study.
Number of homologs is presented alongside the number of genomes which encode each gene and the number of speciations (spe), horizontal gene transfers (hgt), duplications (dup) and losses (los) predicted by ecceTERA using CIR, LN, and UGAM clock models with default event costs.

Table 2 : Firmicutes and Deltaproteobacteria diverged from one another in the Paleo-or Eo-archean.
Estimated divergence times for the MRCA of Firmicutes and Deltaproteobacteria in molecular clocks constructed with three different models are presented.Ga, billion years ago.

Table 4 : Calibration points used in molecular clock analyses.
We stress that these are conservative estimates, as the origin of a metabolism or group of organisms may predate its first widely-accepted expression in the rock record.

Table 5 : Source of HMM profiles used to identify phosphonate cycling genes.
Each was downloaded from an equivalog HMM used in the NCBI's prokaryotic genome annotation pipeline.